Search CORE

26 research outputs found

Topology-aware GPU scheduling for learning workloads in cloud environments

Author: Amaral Marcelo
Carrera David
Polo Bardés Jordà
Seelam Seetharami
Steinder Malgorzata
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2017
Field of study

Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Objcache: An Elastic Filesystem over External Persistent Storage for Container Clusters

Author: Chiba Tatsuhiro
Choochotkaew Sunyanan
Pfefferle Jonas
Seelam Seetharami
Wen Hui-fang
Yoshimura Takeshi
Publication venue
Publication date: 04/09/2023
Field of study

Container virtualization enables emerging AI workloads such as model serving, highly parallelized training, machine learning pipelines, and so on, to be easily scaled on demand on the elastic cloud infrastructure. Particularly, AI workloads require persistent storage to store data such as training inputs, models, and checkpoints. An external storage system like cloud object storage is a common choice because of its elasticity and scalability. To mitigate access latency to external storage, caching at a local filesystem is an essential technique. However, building local caches on scaling clusters must cope with explosive disk usage, redundant networking, and unexpected failures. We propose objcache, an elastic filesystem over external storage. Objcache introduces an internal transaction protocol over Raft logging to enable atomic updates of distributed persistent states with consistent hashing. The proposed transaction protocol can also manage inode dirtiness by maintaining the consistency between the local cache and external storage. Objcache supports scaling down to zero by automatically evicting dirty files to external storage. Our evaluation reports that objcache speeded up model serving startup by 98.9% compared to direct copies via S3 interfaces. Scaling up with dirty files completed from 2 to 14 seconds with 1024 dirty files.Comment: 13 page

arXiv.org e-Print Archive

Throttling I/O Streams to Accelerate File-IO Performance ABSTRACT

Author: Seetharami Seelam
Publication venue
Publication date
Field of study

To increase the scale and performance of scientific applications

CiteSeerX

Towards dynamic adaptation of I /O scheduling in commodity operating systems

Author: Seelam Seetharami R
Publication venue: ScholarWorks@UTEP
Publication date: 01/01/2006
Field of study

Disk scheduling algorithms in operating systems often are designed to satisfy a primary application data delivery requirement. Multiple concurrent and conflicting requirements need to be satisfied to support concurrently executing applications. Accordingly, a disk scheduling algorithm that is designed to concurrently satisfy multiple data delivery requirements is crucial to the support of diverse architectures, disk systems, and workloads. To concurrently satisfy multiple data delivery requirements, we implement and evaluate a mechanism that dynamically switches between two policies in order to enforce two performance requirements. This two-policy adaptation strives to provide latency guarantees for requests at all times and, when all requests meet their latency requirements, it strives to simultaneously provide fairness in terms of number of requests. This dissertation presents a description of the two-policy adaptation and demonstrates why, in some cases, it cannot simultaneously satisfy both performance requirements. Also, the fairness of the number of requests metric is disputed. Accordingly, we next leverage a fair queuing discipline and implement a fair scheduling algorithm that can be extended to satisfy multiple data delivery requirements concurrently. Our new scheduling strategy uses compensated disk-time as the resource-sharing metric and achieves fairness and predictable application performance. To the best of our knowledge, ours is the only I/O scheduling algorithm that provides predictability in application performance, one of the most important system requirements. In addition, this algorithm and the underlying queuing system is flexible enough to implement the enforcement of a number of other performance requirements such as request latencies, anticipation of requests, and service level objectives, while concurrently providing fairness. In this dissertation we describes the impact of different resource-sharing metrics of conventional fair scheduling on application performance predictability. We present the fairness properties, analytical and experimental evaluations of various fair scheduling algorithms

DigitalCommons@UTEP

An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes

Author: Bernd Mohr
Luiz Derose
Seetharami Seelam
Publication venue
Publication date
Field of study

Abstract. OpenMP has emerged as the standard for shared memory parallel programming. Unfortunately, it does not provide a standardized performance monitoring interface, such that users and tools builders could write portable libraries for performance measurement of OpenMP programs. In this paper we present an implementation of a performance monitoring interface for OpenMP, based on the POMP proposal, which is built on top of DPCL, an infrastructure for binary and dynamic instrumentation. We also present overhead measurements of our implementation and show examples of utilization with two versions of POMP compliant libraries.

CiteSeerX

Throttling I/O Streams to Accelerate File-IO Performance

Author: Andre Kerstens
Patricia J. Teller
Seetharami Seelam
Publication venue
Publication date: 01/04/2007
Field of study

Abstract. To increase the scale and performance of high-performance computing (HPC) applications, it is common to distribute computation across multiple processors. Often without realizing it, file I/O is parallelized with the computation. An implication of this is that multiple compute tasks are likely to concurrently access the I/O nodes of an HPC system. When a large number of I/O streams concurrently access an I/O node, I/O performance tends to degrade, impacting application execution time. This paper presents experimental results that show that controlling the number of file-I/O streams that concurrently access an I/O node can enhance application performance. We call this mechanism file-I/O stream throttling. The paper (1) describes this mechanism and demonstrates how it can be implemented either at the application or system software layers, and (2) presents results of experiments driven by the cosmology application benchmark MADbench, executed on a variety of computing systems, that demonstrate the effectiveness of file-I/O stream throttling. The I/O pattern of MADbench resembles that of a large class of HPC applications.

DigitalCommons@UTEP

CiteSeerX

Scalability Analysis of Job Scheduling using Virtual Nodes

Author: Jing Xu
Liana Fong
Norman Bobroff
Seetharami Seelam
Publication venue
Publication date
Field of study

Abstract. It is important to identify scalability constraints in existing job scheduling software as they are applied to next generation parallel systems. In this paper, we analyze the scalability of job scheduling and job dispatching functions in the IBM LoadLeveler job scheduler. To enable this scalability study, we propose and implement a new virtualization method to deploy different size LoadLeveler clusters with minimal number of physical machines. Our scalability studies with the virtualization show that the LoadLeveler resource manager can comfortably handle over 12,000 compute nodes, the largest scale we have tested so far. However, our study shows that the static resource matching in the scheduling cycle and job object processing during the hierarchical job launching are two impediments for the scalability of LoadLeveler.

CiteSeerX